Report Requirements: You are required to apply the knowledge acquired during the course – working on the dataset referenced in page 2 – to create a report using RStudio and RMarkdown along with the other packages studied (mainly tidyverse and tidymodels).

Task #1: Choose at least two of the following points to answer:  Is there a relationship between a movie budget and its number of votes? What about the rating?  What are the genres that have the highest average rating?  What are the plot keywords that have the highest average rating?  Who are the highly rated directors? Who are the highly rated actors?  Who are the most profitable directors? Who are the most profitable actors?  Which countries have higher average rating? Which countries produced more movies?  How did the number of movie rating audience evolved over the years?

Task #2: Design and implement a predictive model to find the expected movie rating score using the features supplied in the dataset. You are free to choose any of the following:  The prediction method (classification, regression, or otherwise).  The features needed for prediction.  The Model evaluation method or criteria.

Dataset Specifications It is a dataset from the TMDB (The Movies Database) website for ~5000 movie titles separated into two files:

Movie Metadata

=====================================================================================================================

Column Name | Column Description |

|====================================================================================================================

movie_id | the movie id |
title | the movie title |
original_language | the movie language |
release_date | the movie releasedate |
budget | the movie budget |
revenue | the movie revenue |
runtime | the movie runtime in minutes |
vote_average | the movie TMDB average rating |
vote_count | the movie TMDB rating users count |
popularity | the movie TMDB popularity score |
genres | the movie genres separated by a pipe |
keywords | the movie keywords separated by a pipe |
production_companies | the movie companies separated by a pipe |
production_countries | the movie countries separated by a pipe |

=====================================================================================================================

Movie Cast and Crew

=====================================================================================================================

Column Name | Column Description |

=====================================================================================================================

movie_id | the movie id |
director | the movie director name |
producer | the movie producer name |
actor_1 | the movie actor_1 name |
actor_2 | the movie actor_2 name |
actor_3 | the movie actor_3 name |

=====================================================================================================================

=====================================================================================================================

(Attached with the dataset the script used to transform from the original dataset) |

=====================================================================================================================

#Start From Here

library('tidyverse')
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library('rmarkdown')
library('tidymodels')
## Warning: package 'tidymodels' was built under R version 4.0.5
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom        0.7.6      v rsample      0.0.9 
## v dials        0.0.9      v tune         0.1.5 
## v infer        0.5.4      v workflows    0.2.2 
## v modeldata    0.1.0      v workflowsets 0.0.2 
## v parsnip      0.1.5      v yardstick    0.0.8 
## v recipes      0.1.16
## Warning: package 'broom' was built under R version 4.0.5
## Warning: package 'dials' was built under R version 4.0.5
## Warning: package 'scales' was built under R version 4.0.5
## Warning: package 'infer' was built under R version 4.0.5
## Warning: package 'modeldata' was built under R version 4.0.5
## Warning: package 'parsnip' was built under R version 4.0.5
## Warning: package 'recipes' was built under R version 4.0.5
## Warning: package 'rsample' was built under R version 4.0.5
## Warning: package 'tune' was built under R version 4.0.5
## Warning: package 'workflows' was built under R version 4.0.5
## Warning: package 'workflowsets' was built under R version 4.0.5
## Warning: package 'yardstick' was built under R version 4.0.5
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(readr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
library(DAAG)
## Warning: package 'DAAG' was built under R version 4.0.5
library(dplyr)

Task 1

tmdb_cast_crew <- read_csv("tmdb_cast_crew.csv", 
    col_types = cols(movie_id = col_character()))
#View(tmdb_cast_crew)

tmdb_movies_metadata <- read_csv("tmdb_movies_metadata.csv", 
    col_types = cols(movie_id = col_character(), 
        release_date = col_date(format = "%Y-%m-%d")))
#View(tmdb_movies_metadata)
glimpse(tmdb_movies_metadata)
## Rows: 4,795
## Columns: 14
## $ movie_id             <chr> "19995", "285", "206647", "49026", "49529", "559"~
## $ title                <chr> "Avatar", "Pirates of the Caribbean: At World's E~
## $ original_language    <chr> "en", "en", "en", "en", "en", "en", "en", "en", "~
## $ release_date         <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012-07-16, ~
## $ budget               <dbl> 2.37e+08, 3.00e+08, 2.45e+08, 2.50e+08, 2.60e+08,~
## $ revenue              <dbl> 2787965087, 961000000, 880674609, 1084939099, 284~
## $ runtime              <dbl> 162, 169, 148, 165, 132, 139, 100, 141, 153, 151,~
## $ vote_average         <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7.4, 5.7,~
## $ vote_count           <dbl> 11800, 4500, 4466, 9106, 2124, 3576, 3330, 6767, ~
## $ popularity           <dbl> 150.43758, 139.08262, 107.37679, 112.31295, 43.92~
## $ genres               <chr> "Action|Adventure|Fantasy|Science Fiction", "Adve~
## $ keywords             <chr> "culture clash|future|space war|space colony|soci~
## $ production_companies <chr> "Ingenious Film Partners|Twentieth Century Fox Fi~
## $ production_countries <chr> "United States of America|United Kingdom", "Unite~
glimpse(tmdb_cast_crew)
## Rows: 4,803
## Columns: 6
## $ movie_id <chr> "19995", "285", "206647", "49026", "49529", "559", "38757", "~
## $ director <chr> "James Cameron", "Gore Verbinski", "Sam Mendes", "Christopher~
## $ producer <chr> "James Cameron", "Jerry Bruckheimer", "Barbara Broccoli", "Ch~
## $ actor_1  <chr> "Sam Worthington", "Johnny Depp", "Daniel Craig", "Christian ~
## $ actor_2  <chr> "Zoe Saldana", "Orlando Bloom", "Christoph Waltz", "Michael C~
## $ actor_3  <chr> "Sigourney Weaver", "Keira Knightley", "Léa Seydoux", "Gary O~

Data Wrangling

Here we are going to do 2 steps:

1- Assessing Data: exploring data and giving comments over all the wired, wrong, and missing values in the data.

2- Cleaning data: clean the data by using the comments we wrote.

1 - Assessing Data

Explore data

tmdb_movies_metadata
tmdb_cast_crew
summary(tmdb_movies_metadata)
##    movie_id            title           original_language   release_date       
##  Length:4795        Length:4795        Length:4795        Min.   :1916-09-04  
##  Class :character   Class :character   Class :character   1st Qu.:1999-07-03  
##  Mode  :character   Mode  :character   Mode  :character   Median :2005-09-30  
##                                                           Mean   :2002-12-23  
##                                                           3rd Qu.:2011-02-14  
##                                                           Max.   :2017-02-03  
##                                                           NA's   :1           
##      budget             revenue             runtime       vote_average   
##  Min.   :        0   Min.   :0.000e+00   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:   800000   1st Qu.:0.000e+00   1st Qu.: 94.0   1st Qu.: 5.600  
##  Median : 15000000   Median :1.926e+07   Median :104.0   Median : 6.200  
##  Mean   : 29092674   Mean   :8.240e+07   Mean   :106.9   Mean   : 6.093  
##  3rd Qu.: 40000000   3rd Qu.:9.312e+07   3rd Qu.:118.0   3rd Qu.: 6.800  
##  Max.   :380000000   Max.   :2.788e+09   Max.   :338.0   Max.   :10.000  
##                                          NA's   :2                       
##    vote_count        popularity         genres            keywords        
##  Min.   :    0.0   Min.   :  0.000   Length:4795        Length:4795       
##  1st Qu.:   54.0   1st Qu.:  4.724   Class :character   Class :character  
##  Median :  236.0   Median : 12.963   Mode  :character   Mode  :character  
##  Mean   :  691.4   Mean   : 21.527                                        
##  3rd Qu.:  738.0   3rd Qu.: 28.352                                        
##  Max.   :13752.0   Max.   :875.581                                        
##                                                                           
##  production_companies production_countries
##  Length:4795          Length:4795         
##  Class :character     Class :character    
##  Mode  :character     Mode  :character    
##                                           
##                                           
##                                           
## 
summary(tmdb_cast_crew)
##    movie_id           director           producer           actor_1         
##  Length:4803        Length:4803        Length:4803        Length:4803       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    actor_2            actor_3         
##  Length:4803        Length:4803       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

check na values

sum(is.na(tmdb_cast_crew))
## [1] 1212


comments:
there is 1212 NA values out of 4,803 rows in , tmdb_cast_crew.

colSums(is.na(tmdb_cast_crew))
## movie_id director producer  actor_1  actor_2  actor_3 
##        0       30     1023       43       53       63


comments: imputing the missing values here will not be the optimal thing, we can remove the producer column then removing the remaining na values.


check na values

sum(is.na(tmdb_movies_metadata))
## [1] 956


comments:
there is 956 NA values out of 4,795 rows in , tmdb_movies_metadata. we can impute the missing values

colSums(is.na(tmdb_movies_metadata))
##             movie_id                title    original_language 
##                    0                    0                    0 
##         release_date               budget              revenue 
##                    1                    0                    0 
##              runtime         vote_average           vote_count 
##                    2                    0                    0 
##           popularity               genres             keywords 
##                    0                   28                  409 
## production_companies production_countries 
##                  345                  171


comments: it seems that there is a three columns that have a lot of nan values like : keywords, production_companies, and production_countries. we can remove them simply and these columns are not important that much. then we can remoove the na values.

tmdb_cast_crew tmdb_movies_metadata

sum(duplicated(tmdb_cast_crew))
## [1] 0

comments:
no duplicates.

sum(duplicated(tmdb_movies_metadata))
## [1] 0

comments:
no duplicates.

check unique values and the number of occurience

#length(unique(df))
apply(tmdb_cast_crew, 2, function(x) length(unique(x)))
## movie_id director producer  actor_1  actor_2  actor_3 
##     4803     2352     1767     2096     2721     3096

check unique values and the number of occurience

#length(unique(df))
apply(tmdb_movies_metadata, 2, function(x) length(unique(x)))
##             movie_id                title    original_language 
##                 4795                 4792                   37 
##         release_date               budget              revenue 
##                 3279                  434                 3296 
##              runtime         vote_average           vote_count 
##                  157                   71                 1609 
##           popularity               genres             keywords 
##                 4794                 1175                 4218 
## production_companies production_countries 
##                 3695                  469

2-cleansing

data i will clean :

1-remove columns: keywords, production_companies, and production_countries from tmdb_movies_metadata producer from tmdb_cast_crew

2-remove na values

metadata = subset(tmdb_movies_metadata, select = -c(keywords,production_companies,production_countries))
crew = subset(tmdb_cast_crew, select = -c(producer))

metadata <- na.omit(metadata) 
crew <- na.omit(crew) 
colSums(is.na(metadata))
##          movie_id             title original_language      release_date 
##                 0                 0                 0                 0 
##            budget           revenue           runtime      vote_average 
##                 0                 0                 0                 0 
##        vote_count        popularity            genres 
##                 0                 0                 0
colSums(is.na(crew))
## movie_id director  actor_1  actor_2  actor_3 
##        0        0        0        0        0
metadata

3- check outlaiers

boxplot( metadata$budget ,
main = "budget",
at = c(1),
names = c( "budget"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)

boxplot( metadata$revenue ,
main = "revenue",
at = c(1),
names = c( "revenue"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)

boxplot( metadata$runtime ,
main = "runtime",
at = c(1),
names = c( "runtime"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)

boxplot(metadata$vote_average ,
main = "vote_average",
at = c(1),
names = c("vote_average"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)

boxplot(metadata$vote_count ,
main = "vote_count",
at = c(1),
names = c("vote_count"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)

3- check outlaiers

boxplot( metadata$popularity ,
main = "popularity",
at = c(1),
names = c( "popularity"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)

boxplot just shows that these column has outlaiers but we will keep them.

after assessing data and cleaning it we can start answering the following quetsions:

metadata

What are the genres that have the highest average rating?

metadata %>% 
  group_by (genres) %>% 
  summarise(max_rate = max(vote_average)) %>%
  filter(max(max_rate) == max_rate)

Who are the highly rated directors? Who are the highly rated actors?

metadata %>% inner_join(crew) -> df
## Joining, by = "movie_id"
df %>% 
  group_by (director) %>% 
  summarise(max_rated_director = max(vote_average)) %>%
  filter(max(max_rated_director) == max_rated_director) %>%
  select (director)
df %>% 
  group_by (actor_1) %>% 
  summarise(max_rated_actor_1 = max(vote_average)) %>%
  filter(max(max_rated_actor_1) == max_rated_actor_1) %>%
  select (actor_1)
df %>% 
  group_by (actor_2) %>% 
  summarise(max_rated_actor_2 = max(vote_average)) %>%
  filter(max(max_rated_actor_2) == max_rated_actor_2) %>%
  select (actor_2)
df %>% 
  group_by (actor_3) %>% 
  summarise(max_rated_actor_3 = max(vote_average)) %>%
  filter(max(max_rated_actor_3) == max_rated_actor_3) %>%
  select (actor_3)
df %>%
  select(budget,revenue, runtime , vote_average, vote_count, popularity) %>%
  mutate(budget = (budget - mean(budget))/sd(budget)) %>%
  mutate(revenue = (revenue - mean(revenue))/sd(revenue)) %>%
  mutate(runtime = (runtime - mean(runtime))/sd(runtime)) %>%
  mutate(vote_average = (vote_average - mean(vote_average))/sd(vote_average)) %>%
  mutate(vote_count = (vote_count - mean(vote_count))/sd(vote_count)) %>%
  mutate(popularity = (popularity - mean(popularity))/sd(popularity)) -> df_scaled
df_scaled
ggplot(df_scaled,
       aes(x = vote_average, 
           y = budget)) + 
  geom_point() 

as much the budget increases the vote average increases , there is a weak linear colinearity between the two features

ggplot(df_scaled,
       aes(x = vote_average, 
           y = revenue)) + 
  geom_point() 

as much the revenue increases the vote average increases , there is a weak linear colinearity between the two features

ggplot(df_scaled,
       aes(x = vote_average, 
           y = runtime
)) + 
  geom_point() 

as much the runtime increases the vote average increases , there is a weak linear colinearity between the two features

ggplot(df_scaled,
       aes(x = vote_average, 
           y = vote_count)) + 
  geom_point() 

as much the vote count increases the vote average increases , there is a weak linear colinearity between the two features

ggplot(df_scaled,
       aes(x = vote_average, 
           y = popularity)) + 
  geom_point() 

as much the popularity increases the vote average increases , there is a weak linear colinearity between the two features

Task 2

#lets see the correlation between the independent variables

cor(df_scaled[,names(df_scaled)!="vote_average"])
##               budget   revenue   runtime vote_count popularity
## budget     1.0000000 0.7293913 0.2648640  0.5904653  0.5013138
## revenue    0.7293913 1.0000000 0.2504608  0.7804131  0.6427591
## runtime    0.2648640 0.2504608 1.0000000  0.2710457  0.2189417
## vote_count 0.5904653 0.7804131 0.2710457  1.0000000  0.7767158
## popularity 0.5013138 0.6427591 0.2189417  0.7767158  1.0000000

Predictors are highly independent, that’s good!

df_scaled %>% 
  select(budget,revenue, runtime ,vote_count ,popularity, vote_average) ->df_scaled

Splitting dataset

below splits the df_scaled data set so that 80% is used for training a linear regression model and 20% is used to evaluate the model performance.

# Split the data into training and test set
set.seed(123)
training.samples <- df_scaled$revenue %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- df_scaled[training.samples, ]
test.data <- df_scaled[-training.samples, ]

Building Model

# Build the model
model <- lm(vote_average ~., data = train.data)

Make Predictions

# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)

#Evaluate Model Using R2, EMSE , MAE

data.frame( R2 = R2(predictions, test.data$vote_average),
            RMSE = RMSE(predictions, test.data$vote_average),
            MAE = MAE(predictions, test.data$vote_average))

Comment: we can see that the model don’t fitting the data very well as the r2 is very low by 18%

the prediction error rate, which should be as small as possible

RMSE(predictions, test.data$vote_average)/mean(test.data$vote_average)
## [1] 36.13323

it is high!

#K-fold cross-validation

# Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(vote_average ~., data = df_scaled, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)
## Linear Regression 
## 
## 4715 samples
##    5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4243, 4245, 4244, 4242, 4243, 4243, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.8848452  0.2180938  0.6040948
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE